library(tidyverse)
Warning message:
In `[<-.data.frame`(`*tmp*`, is_list, value = list(`23` = "<S3: blob>")) :
  replacement element 1 has 1 row to replace 0 rows
library(RSQLite)
library(dbplyr)
library(janitor)
library(lubridate)
library(datasets)
library(ggthemes)
library(gganimate)
library(modelr)
library(broom)
library(ggfortify)
library(infer)
library(MASS)
library(tseries)
library(forecast)
library(fable)
library(fabletools)
library(tsibble)
library(tsibbledata)
library(feasts)

1. Data Cleaning

Creating connection to the sqlite database and downloading fires dataset

# Connecting

conn <- dbConnect(SQLite(), "raw_data/FPA_FOD_20170508.sqlite")
# Pulling all the names of the tables in the database file

as.data.frame(dbListTables(conn))
# Making fires dataframe

fires <- tbl(conn, "Fires") %>% collect()

Seeing what other useful information is in the database. The majority are part of the database structure and are not readable in R.

# EPSG worldwide geodetic parameter dataset system
spatial_ref <- tbl(conn, "spatial_ref_sys_all") %>% collect()

# National Wildfire Coordinating Group unit abbreviations 
NWGG <- tbl(conn, "NWCG_UnitIDActive_20170109") %>% collect()
# Disconnect

dbDisconnect(conn)

Selecting columns of interest

fires_small <- fires %>%
  select(NWCG_REPORTING_AGENCY, SOURCE_REPORTING_UNIT_NAME, FIRE_NAME,
         FIRE_YEAR, DISCOVERY_DATE, DISCOVERY_DOY, DISCOVERY_TIME, CONT_DATE,
         CONT_DOY, CONT_TIME, STAT_CAUSE_CODE, STAT_CAUSE_DESCR, FIRE_SIZE, 
         FIRE_SIZE_CLASS, LATITUDE, LONGITUDE, OWNER_CODE, OWNER_DESCR, STATE, 
         COUNTY, FIPS_CODE, FIPS_NAME, Shape)

fires_small <- clean_names(fires_small)

Changing some columms to be factors

fires_small <- fires_small %>%
  mutate(nwcg_reporting_agency = as.factor(nwcg_reporting_agency)) %>%
  mutate(stat_cause_code = as.factor(stat_cause_code)) %>%
  mutate(fire_size_class = as.factor(fire_size_class)) %>%
  mutate(owner_descr = as.factor(owner_descr)) %>%
  mutate(state = as.factor(state)) 

Date is in Julian format, so overwriting with Gregorian format using year and day of year columns. Also adding in a ‘month of year column’ for future use.

fires_small <- fires_small %>%
  mutate(date_origin = as.Date(paste0(fire_year, "-01-01"))) %>%
  mutate(discovery_date = as.Date(discovery_doy, origin = date_origin)) %>%
  mutate(discovery_moy = month(discovery_date, label = TRUE)) %>%
  select(-date_origin)

2. Creating some initial visualisations

Fires per year

year_plot <- fires_small %>%
  group_by(fire_year) %>%
  summarise(num_fires =n())
`summarise()` ungrouping output (override with `.groups` argument)
year_plot %>%
  ggplot +
  aes(x = fire_year, y = num_fires) +
  geom_point() +
  ylim(0, 120000)

  # geom_col(fill = "dark blue", col ="white") +
  # geom_smooth(method = "lm", se = FALSE, colour = "red")

There is a lot of variation in the data between years. Visually it looks like a repeating pattern is occurring every 5 years or so with 4 peaks visible within this reporting period. Having looked at the historic weather for that date range these peaks seems to coincide with recorded heatwaves in 2000, 2006 and 2011.(1)

https://en.wikipedia.org/wiki/List_of_heat_waves

I will try to create a linear model to hopefully show any increase or decrease of fires over time. As this is time series data a linear model will not fit, but it could be interesting to see if it identifies a general underlying trend.

model <- lm(formula = num_fires ~ fire_year, data = year_plot)
summary(model)

Call:
lm(formula = num_fires ~ fire_year, data = year_plot)

Residuals:
   Min     1Q Median     3Q    Max 
-16835  -8688  -2049   9226  34793 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) -609543.8   756667.9  -0.806    0.429
fire_year       343.3      377.7   0.909    0.373

Residual standard error: 12810 on 22 degrees of freedom
Multiple R-squared:  0.03621,   Adjusted R-squared:  -0.007601 
F-statistic: 0.8265 on 1 and 22 DF,  p-value: 0.3731
tidy(model)
clean_names(glance(model))

The R Squared is quite low as expected from the widely spread plot and from the high p value we already know the model is not a good fit

autoplot(model)
`arrange_()` is deprecated as of dplyr 0.7.0.
Please use `arrange()` instead.
See vignette('programming') for more help
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated.

The diagnostic plots agree that the model isn’t a great fit and there is likely to be a curve in the best fit line. As this is time series data we already know this is the case.

year_plot %>%
  add_predictions(model) %>%
  add_residuals(model)
year_plot
year_plot %>%
  ggplot(aes(x = fire_year)) +
  geom_point(aes(y = num_fires)) +
  geom_abline(
    intercept = model$coefficients[1],
    slope = model$coefficients[2],
    col = "red"
  ) +
  ylim(0, 120000)

NA

The plotted best fit line does show a slight increase, but as the P value is far too high I can not accept this model as a true representation of the occuring trend

Just to make sure I’m going to use bootstrapping using the same model

bootstrap_distribution_slope <- year_plot %>%
  specify(formula = num_fires ~ fire_year) %>%
  generate(reps = 10000, type = "bootstrap") %>%
  calculate(stat = "slope")

slope_ci95 <- bootstrap_distribution_slope %>%
  get_ci(level = 0.95, type = "percentile")
slope_ci95
bootstrap_distribution_slope %>%
  visualise(bins = 30) +
  shade_ci(endpoints = slope_ci95)

clean_names(tidy(model, conf.int = TRUE, conf.level = 0.95))

As 0 occurs within the 95% confidence intervals of -283 to +1075 it reinforces the fact that this model can not be used to explain if there are any positive or negative trends that are occurring in this data. It will be more use to use a model that is designed for time series and seasonal variations. For that I shall be also requiring more data points so I will now use monthly data and not yearly.

Fires per month

fires_small %>%
  mutate(year_month = make_date(fire_year, discovery_moy)) %>%
  group_by(year_month) %>%
  summarise(num_fires = n()) %>%
  ggplot +
  aes(x = year_month, y = num_fires) +
  geom_line(col = "dark blue")
`summarise()` ungrouping output (override with `.groups` argument)

Peaks are still shown to be occurring in the summers. The 2006 heatwave is especially visible.

Using the SARIMA model.

monthly <- fires_small %>%
  mutate(year_month = make_date(fire_year, discovery_moy)) %>%
  group_by(year_month) %>%
  summarise(num_fires = n())
`summarise()` ungrouping output (override with `.groups` argument)
write_csv(monthly, path = "clean_data/monthly.csv")

monthly

Continued on seperate worksheet

Fires per day

fires_small %>%
  group_by(discovery_date) %>%
  summarise(num_fires = n()) %>%
  ggplot +
  aes(x = discovery_date, y = num_fires) +
  geom_line(col = "dark blue") 
`summarise()` ungrouping output (override with `.groups` argument)

This shows a typical time series plot with a cyclic variation due to warmer weather in the summer time.

Fires by day of year

fires_small %>%
  group_by(discovery_doy) %>%
  summarise(num_fires = n()) %>%
  ggplot(aes(x = discovery_doy, y = num_fires)) +
  geom_line(col = "dark blue")
`summarise()` ungrouping output (override with `.groups` argument)

The are peaks around day 60-110 and a big peak around 180.

Checking the data to see where the peak occurs

fires_small %>%
  group_by(discovery_doy) %>%
  summarise(num_fires = n()) %>%
  arrange(desc(num_fires))
`summarise()` ungrouping output (override with `.groups` argument)

The 2 highest days of the year are on 185 and 186, which happens to be Independence Day (4th July) on a normal year and a leap year retrospectively. So I imagine most of the extra fires (literally over double the normal amount) are caused by fireworks.

Fires by month of year

fires_small %>%
  group_by(discovery_moy) %>%
  summarise(num_fires = n()) %>%
  ggplot(aes(x = discovery_moy, y = num_fires)) +
  geom_col(fill = "dark blue", col = "white")
`summarise()` ungrouping output (override with `.groups` argument)

There are 2 definite peaks during the year. March and April are probably due to the US “Spring Break”, where schools and Universities are stopped and so families are likely to be on vacation during that period possibly visiting National Parks. July and August is also Summer Break for school with both families visiting Parks and hot weather likely causes of fire outbreaks.

Fires by cause

options(scipen = 999)

fires_small %>%
  group_by(stat_cause_descr) %>%
  summarise(num_fires = n()) %>%
  ggplot +
  aes(reorder(x = stat_cause_descr, num_fires), y = num_fires) +
  geom_col(fill = "dark blue") + 
  coord_flip() 
`summarise()` ungrouping output (override with `.groups` argument)

Fire avg size by cause

fires_small %>%
  group_by(stat_cause_descr) %>%
  summarise(avg_size = mean(fire_size)) %>%
  ggplot +
  aes(reorder(x = stat_cause_descr, avg_size), y = avg_size) +
  geom_col(fill = "dark blue") + 
  coord_flip()
`summarise()` ungrouping output (override with `.groups` argument)

Avg burn time by cause

fires_small %>%
  summarise(num_na = sum(is.na(cont_date)))

Literally half the data is missing for burn time, making it very difficult to do any meaningful analysis

Fires by size

fires_small %>%
  group_by(fire_size_class) %>%
  summarise(num_fires = n()) %>%
  ggplot +
  aes(x = fire_size_class, y = num_fires, fill = fire_size_class) +
  geom_col() +
  scale_fill_manual(values = c("red", "orange", "yellow", "green", "blue", 
                               "purple", "black"),
                    name = "Fire Size Classification",
                    breaks = c("A", "B", "C", "D", "E", "F", "G"),
                    labels = c("A: < 1/4 acre", "B: 1/4 to 10 acres", "C: 10 to 100 acres",
                               "D: 100 to 300 acres", "E: 300 to 1000 acres",
                               "F: 1000 to 5000 acres", "G: More than 5000 acres"))
`summarise()` ungrouping output (override with `.groups` argument)

Geo Spatial wrangling

To make it easier to visually detect frequency of wildfires between states I want display it in a map format. As I’m using ggplot2 already I’m going to also use it for maps with the geom_polygon(), coord_map() along with the ggthemes theme_map() functions.

I’m not entirely sure what geo-spatial information is being held with in the sqlite database file, I’ve made a few attempts to retrieve it but have been unsuccessful. Therefore I’m going to utelise the datasets package which includes various bits of information on the US States, including coordinates for state boundaries.

# State boundary co-ordinates from 'datasets' package

state_map <- map_data("state")
state_map

Annoyingly it doesn’t have the abbreviation of the State, only the full name so I need to add that in. Luckily the ‘datasets’ package also has a vector of States names and abbreviations so I shall make a tibble with them both in.

state.abb
 [1] "AL" "AK" "AZ" "AR" "CA" "CO" "CT" "DE" "FL" "GA" "HI" "ID" "IL" "IN" "IA"
[16] "KS" "KY" "LA" "ME" "MD" "MA" "MI" "MN" "MS" "MO" "MT" "NE" "NV" "NH" "NJ"
[31] "NM" "NY" "NC" "ND" "OH" "OK" "OR" "PA" "RI" "SC" "SD" "TN" "TX" "UT" "VT"
[46] "VA" "WA" "WV" "WI" "WY"
state.name
 [1] "Alabama"        "Alaska"         "Arizona"        "Arkansas"      
 [5] "California"     "Colorado"       "Connecticut"    "Delaware"      
 [9] "Florida"        "Georgia"        "Hawaii"         "Idaho"         
[13] "Illinois"       "Indiana"        "Iowa"           "Kansas"        
[17] "Kentucky"       "Louisiana"      "Maine"          "Maryland"      
[21] "Massachusetts"  "Michigan"       "Minnesota"      "Mississippi"   
[25] "Missouri"       "Montana"        "Nebraska"       "Nevada"        
[29] "New Hampshire"  "New Jersey"     "New Mexico"     "New York"      
[33] "North Carolina" "North Dakota"   "Ohio"           "Oklahoma"      
[37] "Oregon"         "Pennsylvania"   "Rhode Island"   "South Carolina"
[41] "South Dakota"   "Tennessee"      "Texas"          "Utah"          
[45] "Vermont"        "Virginia"       "Washington"     "West Virginia" 
[49] "Wisconsin"      "Wyoming"       
state_list <- tibble(state = state.abb, state_name = state.name)
state_list

The state_map dataframe is in lower case and has the column name ‘region’. I shall change the state_list tibble to be the same format so they can be joined together.

state_list <- tibble(state = state.abb, region = tolower(state.name))

Joing state_list to fires_small datasets

fires_states <- fires_small %>%
  left_join(state_list, by = "state")

fires_states

Checking the join has worked and there are no missing values.

fires_states %>%
  filter(is.na(region))

There does seem to be 22,147 NAs in the ‘region’ column we just made. Scrolling through there are 2 missing States of ‘PR’ and ‘DC’ in the states_list tibble.

After some quick research it seems that there are only 50 States in the US. Washington DC is techincally not counted as a state but as a Federal District, as it is the seat of government, so that was why it wasn’t included in the States tibble originally. PR is Puerto Rico and is also not a state but the largest US territory .

I shall add DC and PR into the state_list and re-join it.

# Adding 2 new states

state.abb <- append(state.abb, c("DC", "PR"))
state.name <- append(state.name, c("District of Columbia", "Puerto Rico"))

state_list <- tibble(state = state.abb, region = tolower(state.name))
# Re-joing tibbles

fires_states <- fires_small %>%
  left_join(state_list, by = "state")
# Checking the join has worked properly and there are no NAs

fires_states %>%
  filter(is.na(region))
Warning in `[<-.data.frame`(`*tmp*`, is_list, value = list(`23` = "<S3: blob>")) :
  replacement element 1 has 1 row to replace 0 rows
# Code below brings up a "vector memory exhausted (limit reached?)" error

# fires_joined <- fires_states %>%
#  right_join(state_map, by = "region")

The data set and geo information is too big to join so I’m going to do a summarise first to get the number of fires per region first.

fires_joined <- fires_states %>% 
    select(region) %>%
    group_by(region) %>%
    summarise(num_fires = n()) %>%
    right_join(state_map, by = "region")
`summarise()` ungrouping output (override with `.groups` argument)

Result!! Now doing first geo spatial visualisation

Total Wildfires per state from 1992-2015

fires_joined %>% 
    ggplot +
    (aes(x = long, y = lat, group = group, fill = num_fires)) + 
    geom_polygon() + 
    geom_path(color = "white") + 
    scale_fill_continuous(low = "darkblue", 
                          high = "darkred",
                          name = "Number of fires") + 
    theme_map() + 
    coord_map("mollweide") + 
    ggtitle("Total US Wildfires from 1992-2015") + 
    theme(plot.title = element_text(hjust = 0.5))

3. Geo Spatial Visualisations

The dataset has a cause of fire column so I shall now create some causation plots.

Getting list of fire causes

fires_states %>%
  distinct(stat_cause_descr) %>%
  arrange(-desc(stat_cause_descr))

Total fire by cause in tabular form

fires_states %>%
  select(stat_cause_descr) %>%
  group_by(stat_cause_descr) %>%
  summarise(num_fires = n ()) %>%
  arrange(desc(num_fires))
`summarise()` ungrouping output (override with `.groups` argument)
NA

Number of fires by state in tabular form

fires_states %>%
  select(region) %>%
  group_by(region) %>%
  summarise(num_fires = n()) %>%
  arrange(desc(num_fires))
`summarise()` ungrouping output (override with `.groups` argument)

As the cause needs to be filtered before the map join, I’m going to either going to have to repeat a whole load of the same code in every single plot or write a function that will do it for me with, saving a lot of typing!

# Function for plotting cause of fire

cause <- function(cause) {
  fires_states %>%
    filter(stat_cause_descr == cause) %>%
    select(region) %>%
    group_by(region) %>%
    summarise(num_fires = n ()) %>%
    right_join(state_map, by = "region") %>%
    ggplot +
    (aes(x = long, y = lat, group = group, fill = num_fires)) + 
    geom_polygon() + 
    geom_path(color = "white") + 
    scale_fill_continuous(low = "darkblue", 
                          high = "darkred",
                          name = "Number of fires") + 
    theme_map() + 
    coord_map("mollweide") + 
    ggtitle(paste0("Total US Wildfires caused by ", cause, " from 1992-2015")) + 
    theme(plot.title = element_text(hjust = 0.5))
}

Wildfires caused by Arson

cause("Arson")
`summarise()` ungrouping output (override with `.groups` argument)

Arson does seem more prevalent in the SE states of Mississippi, Georgia, Alabama and also the western state of California.

Wildfires caused by Campfire

cause("Campfire")
`summarise()` ungrouping output (override with `.groups` argument)

Campfires are the most prevalent in the Western states of Oregon, California and Arizona.

Wildfires caused by Children

cause("Children")
`summarise()` ungrouping output (override with `.groups` argument)

Fires by children are spread about the country, but the most prevalent states are California in the West, Alabama and South Carolina and New Jersey in the east.

Wildfires caused by Debris Burning

cause("Debris Burning")
`summarise()` ungrouping output (override with `.groups` argument)

Fires by burning debris are mostly in the southern warmer states of Texas, Georgia and North Carolina.

Wildfires caused by Equiment Use

cause("Equipment Use")
`summarise()` ungrouping output (override with `.groups` argument)

Most of the fires caused by equipment seem to be in California

Wildfires caused by Fireworks

cause("Fireworks")
`summarise()` ungrouping output (override with `.groups` argument)

Most of the fires caused by fireworks seem to be in the north of the country. Primarily South Dakota, Montana and Washington state.

Wildfires caused by Lightning

cause("Lightning")
`summarise()` ungrouping output (override with `.groups` argument)

Apart from a hotspot of lightning strikes in Florida, the vast majority of fires caused by lightning are in the West of the country. With the 3 most affected states being California, Oregon and Arizona.

Wildfires caused by Miscellious

cause("Miscellaneous")
`summarise()` ungrouping output (override with `.groups` argument)

There seems to be quite a few miscellaneous classifications in California, Texas and New York.

Wildfires caused by Missing/Undefined

cause("Missing/Undefined")
`summarise()` ungrouping output (override with `.groups` argument)

The states with the most missing or undefined data is North and South Carolina, Oklahoma and California.

Wildfires caused by Powerline

cause("Powerline")
`summarise()` ungrouping output (override with `.groups` argument)

Texas has the largest amount of wildfires caused by powerlines. This is likely due to the warm climate and the large proportion of the state that is dry grasslands used for agriculture. (1)

  1. https://uk.reuters.com/article/us-wildfires-texas/trees-and-power-lines-caused-major-texas-fire-idUSTRE78J76A20110920

Wildfires caused by Railroad

cause("Railroad")
`summarise()` ungrouping output (override with `.groups` argument)

By far Florida has the most wildfires caused by railroads.

Wildfires caused by Smoking

cause("Smoking")
`summarise()` ungrouping output (override with `.groups` argument)

Fires caused by smoking seem to be spread around the country, but mainly on the east and west coasts.

Wildfires caused by Structure

cause("Structure")
`summarise()` ungrouping output (override with `.groups` argument)

South Dakota has the largest proportion of fires caused by structures.

Unsurprisingly the southern states seem to have more occurences of wildifre in general, no doubt due to the warmer climate at their latitudes. Also the 1st and 3rd states with the highest number of fires are also the 2 largest States by size. However the 2nd highest State is Georgia, which although it is in the South of the country is only an average sized State. Therefore to get a better picture of what is going on I’m going to look at the proportion of fires occuring by square mile by normalising the State size.

The dataset package also has the area in square miles of each state included in the state.area vector.

state.area
 [1]  51609 589757 113909  53104 158693 104247   5009   2057  58560  58876
[11]   6450  83557  56400  36291  56290  82264  40395  48523  33215  10577
[21]   8257  58216  84068  47716  69686 147138  77227 110540   9304   7836
[31] 121666  49576  52586  70665  41222  69919  96981  45333   1214  31055
[41]  77047  42244 267339  84916   9609  40815  68192  24181  56154  97914
length(state.area)
[1] 50

Annoyingly it also only has 50 states not 52 so I will need to add in DC and PR back in.

(Area figures obtained from Wikipedia)

DC = 68 miles^2 PR = 3515 miles^2

# To make my life easier I'm going to remove the state.abb and .name files and make the tibble again, adding in the land area figures at the same time to make sure they are in the correct order.

rm(state.abb)
rm(state.name)

state.abb <- append(state.abb, c("DC", "PR"))
state.name <- append(state.name, c("District of Columbia", "Puerto Rico"))
state.area <- append(state.area, c("68", "3515"))

state_list <- tibble(state = state.abb, region = tolower(state.name), area = as.numeric(state.area))
# Re-joining tibbles

fires_states <- fires_small %>%
  left_join(state_list, by = "state")

Normalising States area sizes

fires_states %>%
  select(region, area) %>%
  group_by(region, area) %>%
  summarise(num_fires = n()) %>%
  mutate(fires_sqmile = num_fires / area) %>%
  arrange(desc(fires_sqmile))
`summarise()` regrouping output by 'region' (override with `.groups` argument)

This table shows Puerto Rico has the highest proportion of fires compared to its size, followed by New Jersey in the NE of the country and finally by the States in the SE of the country.

fires_states %>%
  select(region, area) %>%
  group_by(region, area) %>%
  summarise(num_fires = n()) %>%
  mutate(fires_sqmile = num_fires / area) %>%
  right_join(state_map, by = "region") %>%
  ggplot +
  (aes(x = long, y = lat, group = group, fill = fires_sqmile)) + 
  geom_polygon() + 
  geom_path(color = "white") + 
  scale_fill_distiller(name = "Fire per Sq Mile", palette = "PuBuGn") +
  theme_map() + 
  coord_map("mollweide") + 
  ggtitle(paste0("Total US Wildfires per Square Mile from 1992-2015")) + 
  theme(plot.title = element_text(hjust = 0.5))
`summarise()` regrouping output by 'region' (override with `.groups` argument)

Puerto Rico is not shown on this map, but visually we can see the data for the other 51 entries, and the south eastern states still have the highest proportion of wildfires. Interestingly New Jersey also shows has a hotspot in the NE of the country.

Do causes change over time?

Splitting causes into 2 group for legibility.

The first group is for directly man created fires.

fires_states %>%
  select(stat_cause_descr, fire_year) %>%
  group_by(fire_year, stat_cause_descr) %>%
  filter(stat_cause_descr == "Arson" | stat_cause_descr == "Campfire" |
           stat_cause_descr == "Children" | stat_cause_descr == "Equipment Use" |
           stat_cause_descr == "Fireworks" | stat_cause_descr == "Smoking") %>%
  summarise(num_fires = n()) %>%
  ggplot +
  aes(x = fire_year, y = num_fires, colour = stat_cause_descr) +
  geom_line()
`summarise()` regrouping output by 'fire_year' (override with `.groups` argument)

The 2 large peaks in Arson are obvious in 1999 and 2006. There was a large heatwave in 2006, but I’m not sure why this would result in an increase in arson. Unless this was just due to the dry ground creating extra fuel to aid the spread of fires that would have normally not resulted in a large scale fire. This may also be the same reason that there is also another peak in 2006 for Equipment Use. Arson however does look to be decreasing since 2006.

And this one for natural occuring fires.

fires_states %>%
  select(stat_cause_descr, fire_year) %>%
  group_by(fire_year, stat_cause_descr) %>%
  filter(stat_cause_descr == "Debris Burning" | stat_cause_descr == "Lightning" |
           stat_cause_descr == "Miscellaneous" | stat_cause_descr == 
           "Missing/Undefined" | stat_cause_descr == "Powerline" | 
           stat_cause_descr == "Railroad" | stat_cause_descr == "Structure") %>%
  summarise(num_fires = n()) %>%
  ggplot +
  aes(x = fire_year, y = num_fires, colour = stat_cause_descr) +
  geom_line()
`summarise()` regrouping output by 'fire_year' (override with `.groups` argument)

Similar peaks can be seen in Debris, Miscellaneous and lightning in the heatwave of 2006 that left the ground very dry. There are peaks from 1997 to 2003 in debris, miscellaneous and lightening, but also a trough in missing/undefined, so this is likely to be due to more accurate classification of fires and not using the missing/undefined category as much.

Difference in causes between states

state_map_southern <- state_map %>%
  filter(region == "florida" | region == "georgia" | region == "alabama" |
           region == "mississippi" | region == "south carolina" | 
           region == "north carolina" | region == "tennessee" | 
           region == "arkansas" | region == "louisiana")
fires_states %>%
  filter(fire_year == "1992" | fire_year == "1993" | fire_year == "1994" |
           fire_year == "1995") %>%
  filter(region == "florida" | region == "georgia" | region == "alabama" |
           region == "mississippi" | region == "south carolina" | 
           region == "north carolina" | region == "tennessee" |
           region == "arkansas" | region == "louisiana") %>%
  select(region, stat_cause_descr) %>%
  group_by(region, stat_cause_descr) %>%
  summarise(num_fire = n()) %>%
  top_n(1) %>%
  right_join(state_map_southern, by = "region") %>%
  ggplot +
  (aes(x = long, y = lat, group = group, fill = stat_cause_descr)) + 
  geom_polygon() + 
  geom_path(color = "white") + 
  theme_map() + 
  scale_fill_brewer(name = "Cause of Fires", palette = "PuBuGn") +
  ggtitle("Total US Wildfires main cause from 1992-1995") + 
  theme(plot.title = element_text(hjust = 0.5))
`summarise()` regrouping output by 'region' (override with `.groups` argument)
Selecting by num_fire

fires_states %>%
  filter(fire_year == "1996" | fire_year == "1997" | fire_year == "1998" |
           fire_year == "1999") %>%
  filter(region == "florida" | region == "georgia" | region == "alabama" |
           region == "mississippi" | region == "south carolina" | 
           region == "north carolina" | region == "tennessee" |
           region == "arkansas" | region == "louisiana") %>%
  select(region, stat_cause_descr) %>%
  group_by(region, stat_cause_descr) %>%
  summarise(num_fire = n()) %>%
  top_n(1) %>%
  right_join(state_map_southern, by = "region") %>%
  ggplot +
  (aes(x = long, y = lat, group = group, fill = stat_cause_descr)) + 
  geom_polygon() + 
  geom_path(color = "white") + 
  theme_map() + 
  scale_fill_brewer(name = "Cause of Fires", palette = "PuBuGn") +
  ggtitle("Total US Wildfires main cause from 1996-1999") + 
  theme(plot.title = element_text(hjust = 0.5))
`summarise()` regrouping output by 'region' (override with `.groups` argument)
Selecting by num_fire

fires_states %>%
  filter(fire_year == "2000" | fire_year == "2001" | fire_year == "2002" |
           fire_year == "2003") %>%
  filter(region == "florida" | region == "georgia" | region == "alabama" |
           region == "mississippi" | region == "south carolina" | 
           region == "north carolina" | region == "tennessee" |
           region == "arkansas" | region == "louisiana") %>%
  select(region, stat_cause_descr) %>%
  group_by(region, stat_cause_descr) %>%
  summarise(num_fire = n()) %>%
  top_n(1) %>%
  right_join(state_map_southern, by = "region") %>%
  ggplot +
  (aes(x = long, y = lat, group = group, fill = stat_cause_descr)) + 
  geom_polygon() + 
  geom_path(color = "white") + 
  theme_map() + 
  scale_fill_brewer(name = "Cause of Fires", palette = "PuBuGn") +
  ggtitle("Total US Wildfires main cause from 2000-2003") + 
  theme(plot.title = element_text(hjust = 0.5))
`summarise()` regrouping output by 'region' (override with `.groups` argument)
Selecting by num_fire

fires_states %>%
  filter(fire_year == "2004" | fire_year == "2005" | fire_year == "2006" |
           fire_year == "2007") %>%
  filter(region == "florida" | region == "georgia" | region == "alabama" |
           region == "mississippi" | region == "south carolina" | 
           region == "north carolina" | region == "tennessee" |
           region == "arkansas" | region == "louisiana") %>%
  select(region, stat_cause_descr) %>%
  group_by(region, stat_cause_descr) %>%
  summarise(num_fire = n()) %>%
  top_n(1) %>%
  right_join(state_map_southern, by = "region") %>%
  ggplot +
  (aes(x = long, y = lat, group = group, fill = stat_cause_descr)) + 
  geom_polygon() + 
  geom_path(color = "white") + 
  theme_map() + 
  scale_fill_brewer(name = "Cause of Fires", palette = "PuBuGn") +
  ggtitle("Total US Wildfires main cause from 2004-2007") + 
  theme(plot.title = element_text(hjust = 0.5))
`summarise()` regrouping output by 'region' (override with `.groups` argument)
Selecting by num_fire

fires_states %>%
  filter(fire_year == "2008" | fire_year == "2009" | fire_year == "2010" |
           fire_year == "2011") %>%
  filter(region == "florida" | region == "georgia" | region == "alabama" |
           region == "mississippi" | region == "south carolina" | 
           region == "north carolina" | region == "tennessee" |
           region == "arkansas" | region == "louisiana") %>%
  select(region, stat_cause_descr) %>%
  group_by(region, stat_cause_descr) %>%
  summarise(num_fire = n()) %>%
  top_n(1) %>%
  right_join(state_map_southern, by = "region") %>%
  ggplot +
  (aes(x = long, y = lat, group = group, fill = stat_cause_descr)) + 
  geom_polygon() + 
  geom_path(color = "white") + 
  theme_map() + 
  scale_fill_brewer(name = "Cause of Fires", palette = "PuBuGn") +
  ggtitle("Total US Wildfires main cause from 2008-2011") + 
  theme(plot.title = element_text(hjust = 0.5))
`summarise()` regrouping output by 'region' (override with `.groups` argument)
Selecting by num_fire

fires_states %>%
  filter(fire_year == "2012" | fire_year == "2013" | fire_year == "2014" |
           fire_year == "2015") %>%
  filter(region == "florida" | region == "georgia" | region == "alabama" |
           region == "mississippi" | region == "south carolina" | 
           region == "north carolina" | region == "tennessee" |
           region == "arkansas" | region == "louisiana") %>%
  select(region, stat_cause_descr) %>%
  group_by(region, stat_cause_descr) %>%
  summarise(num_fire = n()) %>%
  top_n(1) %>%
  right_join(state_map_southern, by = "region") %>%
  ggplot +
  (aes(x = long, y = lat, group = group, fill = stat_cause_descr)) + 
  geom_polygon() + 
  geom_path(color = "white") + 
  theme_map() + 
  scale_fill_brewer(name = "Cause of Fires", palette = "PuBuGn") +
  ggtitle("Total US Wildfires main cause from 2012-2015") + 
  theme(plot.title = element_text(hjust = 0.5))
`summarise()` regrouping output by 'region' (override with `.groups` argument)
Selecting by num_fire

Looking at these trends some interesting insights can be seen. For the combined years data Florida stands out as having railroad as its main cause of wildfire, but from the above plots it can be seen that these railroad fires are only the main cause up to the 4 yearly period ending in 2003 and then the main cause changes to lightning until the end of the collection period in 2015. Similarly arson seem reasonably popular in the southern states until 2007, when it no longer appears as the most common cause of wildfire. This downward trend was also noted earlier in the overall causation plots for all states

Correlation between states and fire size

fires_states %>%
  select(region, fire_size_class) %>%
  group_by(region, fire_size_class) %>%
  summarise(num_fire = n()) %>%
  top_n(1) %>%
  right_join(state_map, by = "region") %>%
  ggplot +
  (aes(x = long, y = lat, group = group, fill = fire_size_class)) +
  geom_polygon() +
  geom_path(color = "white") +
  theme_map() +
  scale_fill_brewer(name = "Fire Size Class", palette = "PuBuGn") +
  ggtitle("Most common wildfire size per State 1992-2015") +
  theme(plot.title = element_text(hjust = 0.5))
`summarise()` regrouping output by 'region' (override with `.groups` argument)
Selecting by num_fire

fires_states %>%
  select(region, fire_size_class) %>%
  filter(fire_size_class == "G") %>%
  group_by(region) %>%
  summarise(num_fire = n()) %>%
  right_join(state_map, by = "region") %>%
  ggplot +
  (aes(x = long, y = lat, group = group, fill = num_fire)) +
  geom_polygon() +
  geom_path(color = "white") +
  theme_map() +
  scale_fill_distiller(name = "Number of Fires", palette = "PuBuGn") +
  ggtitle("Number of large class G fires per State 1992-2015") +
  theme(plot.title = element_text(hjust = 0.5))
`summarise()` ungrouping output (override with `.groups` argument)

From the plots we can see that the Western states have the most small fires and also the most large fires! Not entirely the most helpful plots…

Are fires more prevalent in certain months for individual states

fires_states %>%
  filter(fire_year == "1992" | fire_year == "1993" | fire_year == "1994" |
           fire_year == "1995") %>%
  select(region, discovery_moy) %>%
  group_by(region, discovery_moy) %>%
  summarise(num_fire = n()) %>%
  top_n(1) %>%
  right_join(state_map, by = "region") %>%
  ggplot +
  (aes(x = long, y = lat, group = group, fill = discovery_moy)) +
  geom_polygon() +
  geom_path(color = "white") +
  theme_map() +
  scale_fill_brewer(name = "Months of Year", palette = "PuBuGn") +
  ggtitle("Month with most fires per State 1992-1995") +
  theme(plot.title = element_text(hjust = 0.5))
`summarise()` regrouping output by 'region' (override with `.groups` argument)
Selecting by num_fire

fires_states %>%
  filter(fire_year == "1996" | fire_year == "1997" | fire_year == "1998" |
           fire_year == "1999") %>%
  select(region, discovery_moy) %>%
  group_by(region, discovery_moy) %>%
  summarise(num_fire = n()) %>%
  top_n(1) %>%
  right_join(state_map, by = "region") %>%
  ggplot +
  (aes(x = long, y = lat, group = group, fill = discovery_moy)) +
  geom_polygon() +
  geom_path(color = "white") +
  theme_map() +
  scale_fill_brewer(name = "Months of Year", palette = "PuBuGn") +
  ggtitle("Month with most fires per State 1996-1999") +
  theme(plot.title = element_text(hjust = 0.5))
`summarise()` regrouping output by 'region' (override with `.groups` argument)
Selecting by num_fire

fires_states %>%
  filter(fire_year == "2000" | fire_year == "2001" | fire_year == "2002" |
           fire_year == "2003") %>%
  select(region, discovery_moy) %>%
  group_by(region, discovery_moy) %>%
  summarise(num_fire = n()) %>%
  top_n(1) %>%
  right_join(state_map, by = "region") %>%
  ggplot +
  (aes(x = long, y = lat, group = group, fill = discovery_moy)) +
  geom_polygon() +
  geom_path(color = "white") +
  theme_map() +
  scale_fill_brewer(name = "Months of Year", palette = "PuBuGn") +
  ggtitle("Month with most fires per State 2000-2003") +
  theme(plot.title = element_text(hjust = 0.5))
`summarise()` regrouping output by 'region' (override with `.groups` argument)
Selecting by num_fire

fires_states %>%
  filter(fire_year == "2004" | fire_year == "2005" | fire_year == "2006" |
           fire_year == "2007") %>%
  select(region, discovery_moy) %>%
  group_by(region, discovery_moy) %>%
  summarise(num_fire = n()) %>%
  top_n(1) %>%
  right_join(state_map, by = "region") %>%
  ggplot +
  (aes(x = long, y = lat, group = group, fill = discovery_moy)) +
  geom_polygon() +
  geom_path(color = "white") +
  theme_map() +
  scale_fill_brewer(name = "Months of Year", palette = "PuBuGn") +
  ggtitle("Month with most fires per State 2004-2007") +
  theme(plot.title = element_text(hjust = 0.5))
`summarise()` regrouping output by 'region' (override with `.groups` argument)
Selecting by num_fire

fires_states %>%
  filter(fire_year == "2008" | fire_year == "2009" | fire_year == "2010" |
           fire_year == "2011") %>%
  select(region, discovery_moy) %>%
  group_by(region, discovery_moy) %>%
  summarise(num_fire = n()) %>%
  top_n(1) %>%
  right_join(state_map, by = "region") %>%
  ggplot +
  (aes(x = long, y = lat, group = group, fill = discovery_moy)) +
  geom_polygon() +
  geom_path(color = "white") +
  theme_map() +
  scale_fill_brewer(name = "Months of Year", palette = "PuBuGn") +
  ggtitle("Month with most fires per State 2008-2011") +
  theme(plot.title = element_text(hjust = 0.5))
`summarise()` regrouping output by 'region' (override with `.groups` argument)
Selecting by num_fire

fires_states %>%
  filter(fire_year == "2012" | fire_year == "2013" | fire_year == "2014" |
           fire_year == "2015") %>%
  select(region, discovery_moy) %>%
  group_by(region, discovery_moy) %>%
  summarise(num_fire = n()) %>%
  top_n(1) %>%
  right_join(state_map, by = "region") %>%
  ggplot +
  (aes(x = long, y = lat, group = group, fill = discovery_moy)) +
  geom_polygon() +
  geom_path(color = "white") +
  theme_map() +
  scale_fill_brewer(name = "Months of Year", palette = "PuBuGn") +
  ggtitle("Month with most fires per State 2012-2015") +
  theme(plot.title = element_text(hjust = 0.5))
`summarise()` regrouping output by 'region' (override with `.groups` argument)
Selecting by num_fire

fires_states %>%
  filter(fire_year == "2010" | fire_year == "2011" | fire_year == "2012" |
           fire_year == "2013" | fire_year == "2014" | fire_year == "2015") %>%
  select(region, discovery_moy) %>%
  group_by(region, discovery_moy) %>%
  summarise(num_fire = n()) %>%
  top_n(1) %>%
  right_join(state_map, by = "region") %>%
  ggplot +
  (aes(x = long, y = lat, group = group, fill = discovery_moy)) +
  geom_polygon() +
  geom_path(color = "white") +
  theme_map() +
  scale_fill_brewer(name = "Months of Year", palette = "PuBuGn") +
  ggtitle("Month with most fires in per State 2010-2015") +
  theme(plot.title = element_text(hjust = 0.5))
`summarise()` regrouping output by 'region' (override with `.groups` argument)
Selecting by num_fire

The above plots are quite interesting. The months of the year that have the most seems to widely change in certain state. Mainly the east half of the country have the most fires in the Spring (Feb-May) and the western part of the country have the most fires later on in Summer and Fall (Jun-Oct). There are however a few exceptions that can be seen in the 2004-2007 and 2008-2011 data Texas has the most fires in January. Florida also mostly conformed to the East/West split with the majority of its worst months for fires taking place in March or April up until 2007, then the most common month moves later into June and July for the rest of the reporting period until 2015. This may have to due with main cause of fires in Florida changing from railroad to lightning related about the same time, as we noted earlier on when looking at causation. As July is the main month for tropical storms and lightning in Florida this is a possible cause for the highest month becoming later in the year than before. (2)

  1. https://www.weather.gov/mlb/fl_lightning_climo
---
title: "R Notebook"
output: html_notebook
---

```{r}
library(tidyverse)
library(RSQLite)
library(dbplyr)
library(janitor)
library(lubridate)
library(datasets)
library(ggthemes)
library(gganimate)
library(modelr)
library(broom)
library(ggfortify)
library(infer)
library(MASS)
library(tseries)
library(forecast)
library(fable)
library(fabletools)
library(tsibble)
library(tsibbledata)
library(feasts)
```


# 1.  Data Cleaning


####  Creating connection to the sqlite database and downloading fires dataset

```{r}
# Connecting

conn <- dbConnect(SQLite(), "raw_data/FPA_FOD_20170508.sqlite")
```

```{r}
# Pulling all the names of the tables in the database file

as.data.frame(dbListTables(conn))
```

```{r}
# Making fires dataframe

fires <- tbl(conn, "Fires") %>% collect()
```


#### Seeing what other useful information is in the database.  The majority are part of the database structure and are not readable in R.

```{r}
# EPSG worldwide geodetic parameter dataset system
spatial_ref <- tbl(conn, "spatial_ref_sys_all") %>% collect()

# National Wildfire Coordinating Group unit abbreviations 
NWGG <- tbl(conn, "NWCG_UnitIDActive_20170109") %>% collect()
```


```{r}
# Disconnect

dbDisconnect(conn)
```


### Selecting columns of interest

```{r}
fires_small <- fires %>%
  select(NWCG_REPORTING_AGENCY, SOURCE_REPORTING_UNIT_NAME, FIRE_NAME,
         FIRE_YEAR, DISCOVERY_DATE, DISCOVERY_DOY, DISCOVERY_TIME, CONT_DATE,
         CONT_DOY, CONT_TIME, STAT_CAUSE_CODE, STAT_CAUSE_DESCR, FIRE_SIZE, 
         FIRE_SIZE_CLASS, LATITUDE, LONGITUDE, OWNER_CODE, OWNER_DESCR, STATE, 
         COUNTY, FIPS_CODE, FIPS_NAME, Shape)

fires_small <- clean_names(fires_small)
```


### Changing some columms to be factors

```{r}
fires_small <- fires_small %>%
  mutate(nwcg_reporting_agency = as.factor(nwcg_reporting_agency)) %>%
  mutate(stat_cause_code = as.factor(stat_cause_code)) %>%
  mutate(fire_size_class = as.factor(fire_size_class)) %>%
  mutate(owner_descr = as.factor(owner_descr)) %>%
  mutate(state = as.factor(state)) 
```


### Date is in Julian format, so overwriting with Gregorian format using year and day of year columns.  Also adding in a 'month of year column' for future use.

```{r}
fires_small <- fires_small %>%
  mutate(date_origin = as.Date(paste0(fire_year, "-01-01"))) %>%
  mutate(discovery_date = as.Date(discovery_doy, origin = date_origin)) %>%
  mutate(discovery_moy = month(discovery_date, label = TRUE)) %>%
  select(-date_origin)
```



# 2. Creating some initial visualisations


### Fires per year

```{r}
year_plot <- fires_small %>%
  group_by(fire_year) %>%
  summarise(num_fires =n())

year_plot %>%
  ggplot +
  aes(x = fire_year, y = num_fires) +
  geom_point() +
  ylim(0, 120000)

```
**There is a lot of variation in the data between years.  Visually it looks like a repeating pattern is occurring every 5 years or so with 4 peaks visible within this reporting period.  Having looked at the historic weather for that date range these peaks seems to coincide with recorded heatwaves in 2000, 2006 and 2011.(1)**

https://en.wikipedia.org/wiki/List_of_heat_waves


#### I will try to create a linear model to hopefully show any increase or decrease of fires over time.  As this is time series data a linear model will not fit, but it could be interesting to see if it identifies a general underlying trend.


```{r}
# creating the linear model

model <- lm(formula = num_fires ~ fire_year, data = year_plot)
summary(model)
```
```{r}
tidy(model)
```

```{r}
clean_names(glance(model))
```

**The R Squared is quite low as expected from the widely spread plot and from the high p value we already know the model is not a good fit**


```{r}
# diagnostic plots

autoplot(model)
```

**The diagnostic plots agree that the model isn't a great fit and there is likely to be a curve in the best fit line.  As this is time series data we already know this is the case.**


```{r}
year_plot %>%
  add_predictions(model) %>%
  add_residuals(model)
year_plot
```

```{r}
# plotting the model best fit line

year_plot %>%
  ggplot(aes(x = fire_year)) +
  geom_point(aes(y = num_fires)) +
  geom_abline(
    intercept = model$coefficients[1],
    slope = model$coefficients[2],
    col = "red"
  ) +
  ylim(0, 120000)
  
```

**The plotted best fit line does show a slight increase, but as the P value is far too high I can not accept this model as a true representation of the occuring trend**


### Just to make sure I'm going to use bootstrapping using the same model


```{r}
bootstrap_distribution_slope <- year_plot %>%
  specify(formula = num_fires ~ fire_year) %>%
  generate(reps = 10000, type = "bootstrap") %>%
  calculate(stat = "slope")

slope_ci95 <- bootstrap_distribution_slope %>%
  get_ci(level = 0.95, type = "percentile")
slope_ci95
```

```{r}
bootstrap_distribution_slope %>%
  visualise(bins = 30) +
  shade_ci(endpoints = slope_ci95)
```

```{r}
clean_names(tidy(model, conf.int = TRUE, conf.level = 0.95))
```

**As 0 occurs within the 95% confidence intervals of -283 to +1075 it reinforces the fact that this model can not be used to explain if there are any positive or negative trends that are occurring in this data.  It will be more use to use a model that is designed for time series and seasonal variations.  For that I shall be also requiring more data points so I will now use monthly data and not yearly.**



### Fires per month

```{r}
fires_small %>%
  mutate(year_month = make_date(fire_year, discovery_moy)) %>%
  group_by(year_month) %>%
  summarise(num_fires = n()) %>%
  ggplot +
  aes(x = year_month, y = num_fires) +
  geom_line(col = "dark blue")
```

**Peaks are still shown to be occurring in the summers. The 2006 heatwave is especially visible.**



### Using the SARIMA model.

```{r}
monthly <- fires_small %>%
  mutate(year_month = make_date(fire_year, discovery_moy)) %>%
  group_by(year_month) %>%
  summarise(num_fires = n())

write_csv(monthly, path = "clean_data/monthly.csv")

monthly
```

**Continued on seperate worksheet**




### Fires per day

```{r}
fires_small %>%
  group_by(discovery_date) %>%
  summarise(num_fires = n()) %>%
  ggplot +
  aes(x = discovery_date, y = num_fires) +
  geom_line(col = "dark blue") 

```

**This shows a typical time series plot with a cyclic variation due to warmer weather in the summer time.**





### Fires by day of year


```{r}
fires_small %>%
  group_by(discovery_doy) %>%
  summarise(num_fires = n()) %>%
  ggplot(aes(x = discovery_doy, y = num_fires)) +
  geom_line(col = "dark blue")
```


**The are peaks around day 60-110 and a big peak around 180.**

#### Checking the data to see where the peak occurs

```{r}
fires_small %>%
  group_by(discovery_doy) %>%
  summarise(num_fires = n()) %>%
  arrange(desc(num_fires))
```
**The 2 highest days of the year are on 185 and 186, which happens to be Independence Day (4th July) on a normal year and a leap year retrospectively.  So I imagine most of the extra fires (literally over double the normal amount) are caused by fireworks.**



### Fires by month of year

```{r}
fires_small %>%
  group_by(discovery_moy) %>%
  summarise(num_fires = n()) %>%
  ggplot(aes(x = discovery_moy, y = num_fires)) +
  geom_col(fill = "dark blue", col = "white")
```

**There are 2 definite peaks during the year.  March and April are probably due to the US "Spring Break", where schools and Universities are stopped and so families are likely to be on vacation during that period possibly visiting National Parks.  July and August is also Summer Break for school with both families visiting Parks and hot weather likely causes of fire outbreaks.**




### Fires by cause

```{r}
options(scipen = 999)

fires_small %>%
  group_by(stat_cause_descr) %>%
  summarise(num_fires = n()) %>%
  ggplot +
  aes(reorder(x = stat_cause_descr, num_fires), y = num_fires) +
  geom_col(fill = "dark blue") + 
  coord_flip() 
```


### Fire avg size by cause

```{r}
fires_small %>%
  group_by(stat_cause_descr) %>%
  summarise(avg_size = mean(fire_size)) %>%
  ggplot +
  aes(reorder(x = stat_cause_descr, avg_size), y = avg_size) +
  geom_col(fill = "dark blue") + 
  coord_flip()
```


### Avg burn time by cause

```{r}
fires_small %>%
  summarise(num_na = sum(is.na(cont_date)))
```
*Literally half the data is missing for burn time, making it very difficult to do any meaningful analysis*



### Fires by size


```{r}
fires_small %>%
  group_by(fire_size_class) %>%
  summarise(num_fires = n()) %>%
  ggplot +
  aes(x = fire_size_class, y = num_fires, fill = fire_size_class) +
  geom_col() +
  scale_fill_manual(values = c("red", "orange", "yellow", "green", "blue", 
                               "purple", "black"),
                    name = "Fire Size Classification",
                    breaks = c("A", "B", "C", "D", "E", "F", "G"),
                    labels = c("A: < 1/4 acre", "B: 1/4 to 10 acres", "C: 10 to 100 acres",
                               "D: 100 to 300 acres", "E: 300 to 1000 acres",
                               "F: 1000 to 5000 acres", "G: More than 5000 acres"))

```



# Geo Spatial wrangling 


### To make it easier to visually detect frequency of wildfires between states I want display it in a map format.  As I'm using ggplot2 already I'm going to also use it for maps with the `geom_polygon()`, `coord_map()` along with the ggthemes `theme_map()` functions.


#### I'm not entirely sure what geo-spatial information is being held with in the sqlite database file, I've made a few attempts to retrieve it but have been unsuccessful.  Therefore I'm going to utelise the `datasets` package which includes various bits of information on the US States, including coordinates for state boundaries.


```{r}
# State boundary co-ordinates from 'datasets' package

state_map <- map_data("state")
state_map
```


#### Annoyingly it doesn't have the abbreviation of the State, only the full name so I need to add that in.  Luckily the 'datasets' package also has a vector of States names and abbreviations so I shall make a tibble with them both in.


```{r}
state.abb
```

```{r}
state.name
```

```{r}
state_list <- tibble(state = state.abb, state_name = state.name)
state_list
```


#### The `state_map` dataframe is in lower case and has the column name 'region'.  I shall change the `state_list` tibble to be the same format so they can be joined together.


```{r}
state_list <- tibble(state = state.abb, region = tolower(state.name))
```


#### Joing `state_list` to `fires_small` datasets

```{r}
fires_states <- fires_small %>%
  left_join(state_list, by = "state")

fires_states
```


#### Checking the join has worked and there are no missing values.

```{r}
fires_states %>%
  filter(is.na(region))
```


#### There does seem to be 22,147 NAs in the 'region' column we just made.  Scrolling through there are 2 missing States of 'PR' and 'DC' in the `states_list` tibble.

#### After some quick research it seems that there are only 50 States in the US. Washington DC is techincally not counted as a state but as a Federal District, as it is the seat of government, so that was why it wasn't included in the `States` tibble originally.  PR is Puerto Rico and is also not a state but the largest US territory .


#### I shall add DC and PR into the state_list and re-join it.

```{r}
# Adding 2 new states

state.abb <- append(state.abb, c("DC", "PR"))
state.name <- append(state.name, c("District of Columbia", "Puerto Rico"))

state_list <- tibble(state = state.abb, region = tolower(state.name))
```


```{r}
# Re-joing tibbles

fires_states <- fires_small %>%
  left_join(state_list, by = "state")
```


```{r}
# Checking the join has worked properly and there are no NAs

fires_states %>%
  filter(is.na(region))
```

```{r}
# Code below brings up a "vector memory exhausted (limit reached?)" error

# fires_joined <- fires_states %>%
#  right_join(state_map, by = "region")
```


#### The data set and geo information is too big to join so I'm going to do a summarise first to get the number of fires per region first.

```{r}
fires_joined <- fires_states %>% 
    select(region) %>%
    group_by(region) %>%
    summarise(num_fires = n()) %>%
    right_join(state_map, by = "region")
```

**Result!!  Now doing first geo spatial visualisation**


### Total Wildfires per state from 1992-2015

```{r}
fires_joined %>% 
    ggplot +
    (aes(x = long, y = lat, group = group, fill = num_fires)) + 
    geom_polygon() + 
    geom_path(color = "white") + 
    scale_fill_continuous(low = "darkblue", 
                          high = "darkred",
                          name = "Number of fires") + 
    theme_map() + 
    coord_map("mollweide") + 
    ggtitle("Total US Wildfires from 1992-2015") + 
    theme(plot.title = element_text(hjust = 0.5))
```


# 3. Geo Spatial Visualisations

### The dataset has a cause of fire column so I shall now create some causation plots.


#### Getting list of fire causes

```{r}
fires_states %>%
  distinct(stat_cause_descr) %>%
  arrange(-desc(stat_cause_descr))
```


### Total fire by cause in tabular form

```{r}
fires_states %>%
  select(stat_cause_descr) %>%
  group_by(stat_cause_descr) %>%
  summarise(num_fires = n ()) %>%
  arrange(desc(num_fires))
  
```

### Number of fires by state in tabular form

```{r}
fires_states %>%
  select(region) %>%
  group_by(region) %>%
  summarise(num_fires = n()) %>%
  arrange(desc(num_fires))
```


#### As the cause needs to be filtered before the map join, I'm going to either going to have to repeat a whole load of the same code in every single plot or write a function that will do it for me with, saving a lot of typing!

```{r}
# Function for plotting cause of fire

cause <- function(cause) {
  fires_states %>%
    filter(stat_cause_descr == cause) %>%
    select(region) %>%
    group_by(region) %>%
    summarise(num_fires = n ()) %>%
    right_join(state_map, by = "region") %>%
    ggplot +
    (aes(x = long, y = lat, group = group, fill = num_fires)) + 
    geom_polygon() + 
    geom_path(color = "white") + 
    scale_fill_continuous(low = "darkblue", 
                          high = "darkred",
                          name = "Number of fires") + 
    theme_map() + 
    coord_map("mollweide") + 
    ggtitle(paste0("Total US Wildfires caused by ", cause, " from 1992-2015")) + 
    theme(plot.title = element_text(hjust = 0.5))
}
```



### Wildfires caused by Arson

```{r}
cause("Arson")
```

**Arson does seem more prevalent in the SE states of Mississippi, Georgia, Alabama and also the western state of California.**


### Wildfires caused by Campfire

```{r}
cause("Campfire")
```

**Campfires are the most prevalent in the Western states of Oregon, California and Arizona.**


### Wildfires caused by Children

```{r}
cause("Children")
```

**Fires by children are spread about the country, but the most prevalent states are California in the West, Alabama and South Carolina and New Jersey in the east.**


### Wildfires caused by Debris Burning

```{r}
cause("Debris Burning")
```

**Fires by burning debris are mostly in the southern warmer states of Texas, Georgia and North Carolina.**

### Wildfires caused by Equiment Use

```{r}
cause("Equipment Use")
```

**Most of the fires caused by equipment seem to be in California**


### Wildfires caused by Fireworks

```{r}
cause("Fireworks")
```

**Most of the fires caused by fireworks seem to be in the north of the country.  Primarily South Dakota, Montana and Washington state.**


### Wildfires caused by Lightning

```{r}
cause("Lightning")
```

**Apart from a hotspot of lightning strikes in Florida, the vast majority of fires caused by lightning are in the West of the country.  With the 3 most affected states being California, Oregon and Arizona.**

### Wildfires caused by Miscellious

```{r}
cause("Miscellaneous")
```

**There seems to be quite a few miscellaneous classifications in California, Texas and New York.**


### Wildfires caused by Missing/Undefined

```{r}
cause("Missing/Undefined")
```

**The states with the most missing or undefined data is North and South Carolina, Oklahoma and California.**


### Wildfires caused by Powerline

```{r}
cause("Powerline")
```

**Texas has the largest amount of wildfires caused by powerlines.  This is likely due to the warm climate and the large proportion of the state that is dry grasslands used for agriculture. (1) **

(1) https://uk.reuters.com/article/us-wildfires-texas/trees-and-power-lines-caused-major-texas-fire-idUSTRE78J76A20110920


### Wildfires caused by Railroad

```{r}
cause("Railroad")
```

**By far Florida has the most wildfires caused by railroads.**


### Wildfires caused by Smoking

```{r}
cause("Smoking")
```

**Fires caused by smoking seem to be spread around the country, but mainly on the east and west coasts.**


### Wildfires caused by Structure

```{r}
cause("Structure")
```

**South Dakota has the largest proportion of fires caused by structures.**



#### Unsurprisingly the southern states seem to have more occurences of wildifre in general, no doubt due to the warmer climate at their latitudes.  Also the 1st and 3rd states with the highest number of fires are also the 2 largest States by size. However the 2nd highest State is Georgia, which although it is in the South of the country is only an average sized State.  Therefore to get a better picture of what is going on I'm going to look at the proportion of fires occuring by square mile by normalising the State size.

#### The `dataset` package also has the area in square miles of each state included in the `state.area` vector.

```{r}
state.area
```

```{r}
length(state.area)
```

#### Annoyingly it also only has 50 states not 52 so I will need to add in DC and PR back in.  

(Area figures obtained from Wikipedia)

DC = 68 miles^2
PR = 3515 miles^2


```{r}
# To make my life easier I'm going to remove the state.abb and .name files and make the tibble again, adding in the land area figures at the same time to make sure they are in the correct order.

rm(state.abb)
rm(state.name)

state.abb <- append(state.abb, c("DC", "PR"))
state.name <- append(state.name, c("District of Columbia", "Puerto Rico"))
state.area <- append(state.area, c("68", "3515"))

state_list <- tibble(state = state.abb, region = tolower(state.name), area = as.numeric(state.area))
```

```{r}
# Re-joining tibbles

fires_states <- fires_small %>%
  left_join(state_list, by = "state")
```


### Normalising States area sizes

```{r}
fires_states %>%
  select(region, area) %>%
  group_by(region, area) %>%
  summarise(num_fires = n()) %>%
  mutate(fires_sqmile = num_fires / area) %>%
  arrange(desc(fires_sqmile))
```

#### This table shows Puerto Rico has the highest proportion of fires compared to its size, followed by New Jersey in the NE of the country and finally by the States in the SE of the country.


```{r}
fires_states %>%
  select(region, area) %>%
  group_by(region, area) %>%
  summarise(num_fires = n()) %>%
  mutate(fires_sqmile = num_fires / area) %>%
  right_join(state_map, by = "region") %>%
  ggplot +
  (aes(x = long, y = lat, group = group, fill = fires_sqmile)) + 
  geom_polygon() + 
  geom_path(color = "white") + 
  scale_fill_distiller(name = "Fire per Sq Mile", palette = "PuBuGn") +
  theme_map() + 
  coord_map("mollweide") + 
  ggtitle(paste0("Total US Wildfires per Square Mile from 1992-2015")) + 
  theme(plot.title = element_text(hjust = 0.5))
```

**Puerto Rico is not shown on this map, but visually we can see the data for the other 51 entries, and the south eastern states still have the highest proportion of wildfires.  Interestingly New Jersey also shows has a hotspot in the NE of the country.**


### Do causes change over time?


#### Splitting causes into 2 group for legibility. 

#### The first group is for directly man created fires.

```{r}
fires_states %>%
  select(stat_cause_descr, fire_year) %>%
  group_by(fire_year, stat_cause_descr) %>%
  filter(stat_cause_descr == "Arson" | stat_cause_descr == "Campfire" |
           stat_cause_descr == "Children" | stat_cause_descr == "Equipment Use" |
           stat_cause_descr == "Fireworks" | stat_cause_descr == "Smoking") %>%
  summarise(num_fires = n()) %>%
  ggplot +
  aes(x = fire_year, y = num_fires, colour = stat_cause_descr) +
  geom_line()
```

**The 2 large peaks in Arson are obvious in 1999 and 2006. There was a large heatwave in 2006, but I'm not sure why this would result in an increase in arson.  Unless this was just due to the dry ground creating extra fuel to aid the spread of fires that would have normally not resulted in a large scale fire.  This may also be the same reason that there is also another peak in 2006 for Equipment Use.  Arson however does look to be decreasing since 2006.**


#### And this one for natural occuring fires.

```{r}
fires_states %>%
  select(stat_cause_descr, fire_year) %>%
  group_by(fire_year, stat_cause_descr) %>%
  filter(stat_cause_descr == "Debris Burning" | stat_cause_descr == "Lightning" |
           stat_cause_descr == "Miscellaneous" | stat_cause_descr == 
           "Missing/Undefined" | stat_cause_descr == "Powerline" | 
           stat_cause_descr == "Railroad" | stat_cause_descr == "Structure") %>%
  summarise(num_fires = n()) %>%
  ggplot +
  aes(x = fire_year, y = num_fires, colour = stat_cause_descr) +
  geom_line()
```

**Similar peaks can be seen in Debris, Miscellaneous and lightning in the heatwave of 2006 that left the ground very dry.  There are peaks from 1997 to 2003 in debris, miscellaneous and lightening, but also a trough in missing/undefined, so this is likely to be due to more accurate classification of fires and not using the missing/undefined category as much.**



### Difference in causes between states


```{r}
state_map_southern <- state_map %>%
  filter(region == "florida" | region == "georgia" | region == "alabama" |
           region == "mississippi" | region == "south carolina" | 
           region == "north carolina" | region == "tennessee" | 
           region == "arkansas" | region == "louisiana")
```


```{r}
fires_states %>%
  filter(fire_year == "1992" | fire_year == "1993" | fire_year == "1994" |
           fire_year == "1995") %>%
  filter(region == "florida" | region == "georgia" | region == "alabama" |
           region == "mississippi" | region == "south carolina" | 
           region == "north carolina" | region == "tennessee" |
           region == "arkansas" | region == "louisiana") %>%
  select(region, stat_cause_descr) %>%
  group_by(region, stat_cause_descr) %>%
  summarise(num_fire = n()) %>%
  top_n(1) %>%
  right_join(state_map_southern, by = "region") %>%
  ggplot +
  (aes(x = long, y = lat, group = group, fill = stat_cause_descr)) + 
  geom_polygon() + 
  geom_path(color = "white") + 
  theme_map() + 
  scale_fill_brewer(name = "Cause of Fires", palette = "PuBuGn") +
  ggtitle("Total US Wildfires main cause from 1992-1995") + 
  theme(plot.title = element_text(hjust = 0.5))
```


```{r}
fires_states %>%
  filter(fire_year == "1996" | fire_year == "1997" | fire_year == "1998" |
           fire_year == "1999") %>%
  filter(region == "florida" | region == "georgia" | region == "alabama" |
           region == "mississippi" | region == "south carolina" | 
           region == "north carolina" | region == "tennessee" |
           region == "arkansas" | region == "louisiana") %>%
  select(region, stat_cause_descr) %>%
  group_by(region, stat_cause_descr) %>%
  summarise(num_fire = n()) %>%
  top_n(1) %>%
  right_join(state_map_southern, by = "region") %>%
  ggplot +
  (aes(x = long, y = lat, group = group, fill = stat_cause_descr)) + 
  geom_polygon() + 
  geom_path(color = "white") + 
  theme_map() + 
  scale_fill_brewer(name = "Cause of Fires", palette = "PuBuGn") +
  ggtitle("Total US Wildfires main cause from 1996-1999") + 
  theme(plot.title = element_text(hjust = 0.5))
```


```{r}
fires_states %>%
  filter(fire_year == "2000" | fire_year == "2001" | fire_year == "2002" |
           fire_year == "2003") %>%
  filter(region == "florida" | region == "georgia" | region == "alabama" |
           region == "mississippi" | region == "south carolina" | 
           region == "north carolina" | region == "tennessee" |
           region == "arkansas" | region == "louisiana") %>%
  select(region, stat_cause_descr) %>%
  group_by(region, stat_cause_descr) %>%
  summarise(num_fire = n()) %>%
  top_n(1) %>%
  right_join(state_map_southern, by = "region") %>%
  ggplot +
  (aes(x = long, y = lat, group = group, fill = stat_cause_descr)) + 
  geom_polygon() + 
  geom_path(color = "white") + 
  theme_map() + 
  scale_fill_brewer(name = "Cause of Fires", palette = "PuBuGn") +
  ggtitle("Total US Wildfires main cause from 2000-2003") + 
  theme(plot.title = element_text(hjust = 0.5))
```


```{r}
fires_states %>%
  filter(fire_year == "2004" | fire_year == "2005" | fire_year == "2006" |
           fire_year == "2007") %>%
  filter(region == "florida" | region == "georgia" | region == "alabama" |
           region == "mississippi" | region == "south carolina" | 
           region == "north carolina" | region == "tennessee" |
           region == "arkansas" | region == "louisiana") %>%
  select(region, stat_cause_descr) %>%
  group_by(region, stat_cause_descr) %>%
  summarise(num_fire = n()) %>%
  top_n(1) %>%
  right_join(state_map_southern, by = "region") %>%
  ggplot +
  (aes(x = long, y = lat, group = group, fill = stat_cause_descr)) + 
  geom_polygon() + 
  geom_path(color = "white") + 
  theme_map() + 
  scale_fill_brewer(name = "Cause of Fires", palette = "PuBuGn") +
  ggtitle("Total US Wildfires main cause from 2004-2007") + 
  theme(plot.title = element_text(hjust = 0.5))
```



```{r}
fires_states %>%
  filter(fire_year == "2008" | fire_year == "2009" | fire_year == "2010" |
           fire_year == "2011") %>%
  filter(region == "florida" | region == "georgia" | region == "alabama" |
           region == "mississippi" | region == "south carolina" | 
           region == "north carolina" | region == "tennessee" |
           region == "arkansas" | region == "louisiana") %>%
  select(region, stat_cause_descr) %>%
  group_by(region, stat_cause_descr) %>%
  summarise(num_fire = n()) %>%
  top_n(1) %>%
  right_join(state_map_southern, by = "region") %>%
  ggplot +
  (aes(x = long, y = lat, group = group, fill = stat_cause_descr)) + 
  geom_polygon() + 
  geom_path(color = "white") + 
  theme_map() + 
  scale_fill_brewer(name = "Cause of Fires", palette = "PuBuGn") +
  ggtitle("Total US Wildfires main cause from 2008-2011") + 
  theme(plot.title = element_text(hjust = 0.5))
```



```{r}
fires_states %>%
  filter(fire_year == "2012" | fire_year == "2013" | fire_year == "2014" |
           fire_year == "2015") %>%
  filter(region == "florida" | region == "georgia" | region == "alabama" |
           region == "mississippi" | region == "south carolina" | 
           region == "north carolina" | region == "tennessee" |
           region == "arkansas" | region == "louisiana") %>%
  select(region, stat_cause_descr) %>%
  group_by(region, stat_cause_descr) %>%
  summarise(num_fire = n()) %>%
  top_n(1) %>%
  right_join(state_map_southern, by = "region") %>%
  ggplot +
  (aes(x = long, y = lat, group = group, fill = stat_cause_descr)) + 
  geom_polygon() + 
  geom_path(color = "white") + 
  theme_map() + 
  scale_fill_brewer(name = "Cause of Fires", palette = "PuBuGn") +
  ggtitle("Total US Wildfires main cause from 2012-2015") + 
  theme(plot.title = element_text(hjust = 0.5))
```

**Looking at these trends some interesting insights can be seen.  For the combined years data Florida stands out as having railroad as its main cause of wildfire, but from the above plots it can be seen that these railroad fires are only the main cause up to the 4 yearly period ending in 2003 and then the main cause changes to lightning until the end of the collection period in 2015.  Similarly arson seem reasonably popular in the southern states until 2007, when it no longer appears as the most common cause of wildfire.  This downward trend was also noted earlier in the overall causation plots for all states**



### Correlation between states and fire size

```{r}
fires_states %>%
  select(region, fire_size_class) %>%
  group_by(region, fire_size_class) %>%
  summarise(num_fire = n()) %>%
  top_n(1) %>%
  right_join(state_map, by = "region") %>%
  ggplot +
  (aes(x = long, y = lat, group = group, fill = fire_size_class)) +
  geom_polygon() +
  geom_path(color = "white") +
  theme_map() +
  scale_fill_brewer(name = "Fire Size Class", palette = "PuBuGn") +
  ggtitle("Most common wildfire size per State 1992-2015") +
  theme(plot.title = element_text(hjust = 0.5))
```


```{r}
fires_states %>%
  select(region, fire_size_class) %>%
  filter(fire_size_class == "G") %>%
  group_by(region) %>%
  summarise(num_fire = n()) %>%
  right_join(state_map, by = "region") %>%
  ggplot +
  (aes(x = long, y = lat, group = group, fill = num_fire)) +
  geom_polygon() +
  geom_path(color = "white") +
  theme_map() +
  scale_fill_distiller(name = "Number of Fires", palette = "PuBuGn") +
  ggtitle("Number of large class G fires per State 1992-2015") +
  theme(plot.title = element_text(hjust = 0.5))
```

**From the plots we can see that the Western states have the most small fires and also the most large fires!  Not entirely the most helpful plots...**



### Are fires more prevalent in certain months for individual states

```{r}
fires_states %>%
  filter(fire_year == "1992" | fire_year == "1993" | fire_year == "1994" |
           fire_year == "1995") %>%
  select(region, discovery_moy) %>%
  group_by(region, discovery_moy) %>%
  summarise(num_fire = n()) %>%
  top_n(1) %>%
  right_join(state_map, by = "region") %>%
  ggplot +
  (aes(x = long, y = lat, group = group, fill = discovery_moy)) +
  geom_polygon() +
  geom_path(color = "white") +
  theme_map() +
  scale_fill_brewer(name = "Months of Year", palette = "PuBuGn") +
  ggtitle("Month with most fires per State 1992-1995") +
  theme(plot.title = element_text(hjust = 0.5))
```



```{r}
fires_states %>%
  filter(fire_year == "1996" | fire_year == "1997" | fire_year == "1998" |
           fire_year == "1999") %>%
  select(region, discovery_moy) %>%
  group_by(region, discovery_moy) %>%
  summarise(num_fire = n()) %>%
  top_n(1) %>%
  right_join(state_map, by = "region") %>%
  ggplot +
  (aes(x = long, y = lat, group = group, fill = discovery_moy)) +
  geom_polygon() +
  geom_path(color = "white") +
  theme_map() +
  scale_fill_brewer(name = "Months of Year", palette = "PuBuGn") +
  ggtitle("Month with most fires per State 1996-1999") +
  theme(plot.title = element_text(hjust = 0.5))
```


```{r}
fires_states %>%
  filter(fire_year == "2000" | fire_year == "2001" | fire_year == "2002" |
           fire_year == "2003") %>%
  select(region, discovery_moy) %>%
  group_by(region, discovery_moy) %>%
  summarise(num_fire = n()) %>%
  top_n(1) %>%
  right_join(state_map, by = "region") %>%
  ggplot +
  (aes(x = long, y = lat, group = group, fill = discovery_moy)) +
  geom_polygon() +
  geom_path(color = "white") +
  theme_map() +
  scale_fill_brewer(name = "Months of Year", palette = "PuBuGn") +
  ggtitle("Month with most fires per State 2000-2003") +
  theme(plot.title = element_text(hjust = 0.5))
```


```{r}
fires_states %>%
  filter(fire_year == "2004" | fire_year == "2005" | fire_year == "2006" |
           fire_year == "2007") %>%
  select(region, discovery_moy) %>%
  group_by(region, discovery_moy) %>%
  summarise(num_fire = n()) %>%
  top_n(1) %>%
  right_join(state_map, by = "region") %>%
  ggplot +
  (aes(x = long, y = lat, group = group, fill = discovery_moy)) +
  geom_polygon() +
  geom_path(color = "white") +
  theme_map() +
  scale_fill_brewer(name = "Months of Year", palette = "PuBuGn") +
  ggtitle("Month with most fires per State 2004-2007") +
  theme(plot.title = element_text(hjust = 0.5))
```


```{r}
fires_states %>%
  filter(fire_year == "2008" | fire_year == "2009" | fire_year == "2010" |
           fire_year == "2011") %>%
  select(region, discovery_moy) %>%
  group_by(region, discovery_moy) %>%
  summarise(num_fire = n()) %>%
  top_n(1) %>%
  right_join(state_map, by = "region") %>%
  ggplot +
  (aes(x = long, y = lat, group = group, fill = discovery_moy)) +
  geom_polygon() +
  geom_path(color = "white") +
  theme_map() +
  scale_fill_brewer(name = "Months of Year", palette = "PuBuGn") +
  ggtitle("Month with most fires per State 2008-2011") +
  theme(plot.title = element_text(hjust = 0.5))
```


```{r}
fires_states %>%
  filter(fire_year == "2012" | fire_year == "2013" | fire_year == "2014" |
           fire_year == "2015") %>%
  select(region, discovery_moy) %>%
  group_by(region, discovery_moy) %>%
  summarise(num_fire = n()) %>%
  top_n(1) %>%
  right_join(state_map, by = "region") %>%
  ggplot +
  (aes(x = long, y = lat, group = group, fill = discovery_moy)) +
  geom_polygon() +
  geom_path(color = "white") +
  theme_map() +
  scale_fill_brewer(name = "Months of Year", palette = "PuBuGn") +
  ggtitle("Month with most fires per State 2012-2015") +
  theme(plot.title = element_text(hjust = 0.5))
```


```{r}
fires_states %>%
  filter(fire_year == "2010" | fire_year == "2011" | fire_year == "2012" |
           fire_year == "2013" | fire_year == "2014" | fire_year == "2015") %>%
  select(region, discovery_moy) %>%
  group_by(region, discovery_moy) %>%
  summarise(num_fire = n()) %>%
  top_n(1) %>%
  right_join(state_map, by = "region") %>%
  ggplot +
  (aes(x = long, y = lat, group = group, fill = discovery_moy)) +
  geom_polygon() +
  geom_path(color = "white") +
  theme_map() +
  scale_fill_brewer(name = "Months of Year", palette = "PuBuGn") +
  ggtitle("Month with most fires in per State 2010-2015") +
  theme(plot.title = element_text(hjust = 0.5))
```


**The above plots are quite interesting.  The months of the year that have the most seems to widely change in certain state.  Mainly the east half of the country have the most fires in the Spring (Feb-May) and the western part of the country have the most fires later on in Summer and Fall (Jun-Oct).  There are however a few exceptions that can be seen in the 2004-2007 and 2008-2011 data Texas has the most fires in January.  Florida also mostly conformed to the East/West split with the majority of its worst months for fires taking place in March or April up until 2007, then the most common month moves later into June and July for the rest of the reporting period until 2015. This may have to due with main cause of fires in Florida changing from railroad to lightning related about the same time, as we noted earlier on when looking at causation.  As July is the main month for tropical storms and lightning in Florida this is a possible cause for the highest month becoming later in the year than before. (2)**


(2) https://www.weather.gov/mlb/fl_lightning_climo


